Search CORE

117 research outputs found

Transfer Learning for OCRopus Model Training on Early Printed Books

Author: Puppe Frank
Reul Christian
Springmann Uwe
Wick Christoph
Publication venue
Publication date: 01/12/2017
Field of study

A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, respectively compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on data unrelated to the newly added training and test data can lead to significantly improved recognition results

arXiv.org e-Print Archive

Directory of Open Access Journals

State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

Author: Reul Christian
Springmann Uwe
Wick Christoph
Puppe Frank
Publication venue
Publication date: 01/06/2016
Field of study

In this paper we evaluate Optical Character Recognition (OCR) of 19th century Fraktur scripts without book-specific training using mixed models, i.e. models trained to recognize a variety of fonts and typesets from previously unseen sources. We describe the training process leading to strong mixed OCR models and compare them to freely available models of the popular open source engines OCRopus and Tesseract as well as the commercial state of the art system ABBYY. For evaluation, we use a varied collection of unseen data from books, journals, and a dictionary from the 19th century. The experiments show that training mixed models with real data is superior to training with synthetic data and that the novel OCR engine Calamari outperforms the other engines considerably, on average reducing ABBYYs character error rate (CER) by over 70%, resulting in an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a... creative... submission format. Consequently, some captions might look weird and some links aren't clickable. Extended version with more technical details and some fixes to follo

arXiv.org e-Print Archive

Research Exchange

Semi-automatic generation of test cases by case morphing

Author: Baumeister Joachim
Knauf Rainer
Puppe Frank
Publication venue
Publication date: 05/10/2005
Field of study

Digitale Bibliothek Thüringen

Semi-automatic learning of simple diagnostic scores utilizing complexity measures

Author: Adlassnig
Adlassnig
Buscher
Eich
Frank Puppe
Freitas
Fronhöfer
Huettig
Joachim Baumeister
Martin Atzmueller
Middleton
Miller
Mitchell
Neumann
Ohmann
Paetz
Pople
Puppe
Puppe
Puppe
Schramm
Yen
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref